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Abstract 

This  paper  PescriPes  a  hardware  test  platform  designed  to 
implement  adaptive  lattice  filters  in  real-time.  To  achieve  real-time 
processing  speeds,  algorithm  complexity  was  accommodated  by 
custom  designing  the  computation  engines  with  respect  to  the 
lattice  data  flow.  Execution  speeds  of  the  computation  engines 
were  dramatically  increased  by  a  memory  architecture  that  supports 
efficient  addressing  and  by  providing  a  fioating-point  ALU  with 
numerous  data  paths  and  efficient  implementation  of  division. 
Performance  is  further  enhanced  by  pipelining  multiple  computation 
engines.  In  addition,  the  architecture  Is  flexible  enough  to  support 
other  filter  structures  and  to  allow  observation  of  filter  variables  as 
they  adapt.  With  this  system,  various  Adaptive  Filler  algorithms  are 
being  tested  in  real-time  implementations  of  Adaptive  Line 
Enhancers  (ALEs)  and  Adaptive  Noise  Cancelers  (ANCs)  In  order  to 
characterize  their  performance  and  behavior,  especially  long  term 
staOility  and  the  ability  to  track  non-stationary  signals. 

I.  Introduction 


Adaptive  lattice  algorithms  have  been  expected  to  offer  a 
number  of  advantages  over  conventional  LMS  transversal  algorithms 
including:  faster  rate  of  convergence,  modular  structure, 
insensitivity  to  variations  in  the  eigenvalue  spread  of  the  input 
correlation  matrix,  and  automatic  system  order  detection  [t.2J. 
However,  the  use  of  adaptive  lattice  filters  for  real-time  signal 
processing  has  been  limited.  In  part  this  is  due  to  their 
computational  cost  and  complexity.  This  paper  describes  the  design 
of  a  hardware  test  platform,  called  the  Lattice  Development  System 
(LDS).  designed  and  built  by  the  U.S.  Naval  Ocean  Systems  Center 
(NOSC)  to  implement  adaptive  lattice  filters  for  real-time 
applications.  The  design,  while  tailored  to  implement  adaptive  lattice 
filters  efficiently,  is  flexible  enough  to  support  other  structures  such 
as  adaptive  transversal  filters  for  comparative  performance 
evaluations. 

The  recursive  nature  of  adaptive  filters  makes  their 
impiementation  in  real-time  hardware  an  extremely  interesting  and 
challenging  research  field.  Time-domain  adaptive  algorithms 
generally  require  that  their  filtered  output  be  used  in  updating  the 
filter's  coefficients  before  the  next  input  sample  is  processed.  Thus, 
the  total  processing  latency  must  be  less  than  one  sample  period. 
Unfortunately,  the  coefficient  updating  can  often  dominate  the  total 
processing  time  thereby  limiting  the  adaptive  filter's  potential 
applications.  Multiprocessing  techniques  used  for  performance 
enhancement  of  non-adaptive  filters  can  not  be  directly  applied  to 
adaptive  filters  due  to  adaptive  filter's  lack  of  a  single  computational 
form.  For  instance,  non-adaptive  convolution  is  composed  of  only 
a  sum-of-products.  Adaptive  filters  usually  posses  a  number  of 
computational  forms  as  well  as  requiring  several  different  modes  of 
operation  such  as  initialization,  adaptation,  and  order  expansion  & 
contraction. 

Figure  I  shows  the  tradeoffs  incurred  by  different  methods  of 
implementing  adaptive  filters.  Usually,  performance  and  design 
complexity  are  traded  against  flexibility.  The  measure  of 
performance  m  real-time  systems  is  the  maximum  continuous 
sampling  rate  supported  by  the  hardware:  the  measure  of  flexibility 
IS  how  easily  modifications  can  be  accomplished.  Examples  of 
features  one  may  wish  to  modify  are:  update  algorithm,  filter 
structure,  filter  configuration,  and  filter  parameters  such  as  order 
and  time  constants 


lowest 

Sampling 

Rate 

i 

highest 

Sampling 

Rate 


most 

flexible 

A 


least 

flexible 


•  high-level  program  running  on  general 

purpose  computer 

•  single  programmable  computation  engine 

•  multiple  programmable  computation 

engines  working  in  parallel 

•  direct  implementation  usmg  custom 

digital  VLSI  design 

•  direct  implementation  using  custom 

analog/digital  hybrid  design 


Figure  It  Adaptive  Filter  Implementation  Methods  and  their 
Tradeoffs. 

The  LDS  was  designed  as  an  adaptive  filter  test  platform  to 
characterize  the  performance  and  behavior  of  both  adaptive  lattice 
algorithms  and  adaptive  transversal  algorithms  in  real-time 
applications.  Of  particular  interest  are  the  long  term  stability  and 
the  ability  to  track  non-stationary  signals  as  a  function  of  filter 
parameters.  To  meet  the  combined  needs  of  high  throughput  ana 
flexibility,  the  LDS  was  built  as  a  linear  pipelined  array  of  custom 
designed,  programmable  computation  engines.  Each  engine  is 
optimized  for  implementing  adaptive  lattice  filters  using  a  single 
32-bit  floating-point  ALU  with  provisions  for  floating-point  division 
The  LDS  system  can  be  configured  with  up  to  10  engines,  each 
having  sufficient  memory  to  store  the  variables  for  a  1024  stage 
lattice  filter.  A  system  with  a  full  complement  of  ten  computation 
ertgines  is  capable  of  sustaining  the  computation  of  a  1024  stage 
recursive  least  square  lattice  (RLSL)  filter  llO.ii)  at  a  i.2khz 
real-time  sample  rate. 

Other  adaptive  lattice  filter  implementations,  particularly  those  m 
custom  VLSI,  have  been  based  on  linear  pipelined  processing  arrays 
consistent  with  the  lattice  structure  [3-5].  One  implementation 
made  use  of  a  switched  capacitor  filter  in  an  analog'digital  hybrid 
approach  (6).  Recently,  a  vectorized  adaptive  lattice  was  proposed 
which  allows  for  even  higher  degrees  of  parallelism  [7-9],  Ai|  of 
these  take  advantage  of  the  modularity  provided  by  the  local  ismgie 
lattice  stage)  error  feedback  of  the  lattice. 

Section  II  of  this  paper  is  a  high  level  description  of  the  LDS,  it 
includes  a  brief  description  of  the  system's  three  mam  hardware 
functional  units,  plus  the  Man-Machine  Interface  software  and  other 
software  support  tools.  In  addition.  Section  II  describes  the 
implementation  of  adaptive  filters  in  a  multiple  engine  LDS 
configuration.  Section  ill  provides  more  detailed  information  and 
insight  into  the  design  of  the  functional  unit  that  is  the  processing 
heart  of  the  LDS:  the  Computation  Engine.  A  sample  of  the 
capabilities  of  the  LDS  is  presented  m  Section  IV. 


II.  Hardware  System  Overview 


The  LDS  has  three  separate  functional  units:  the  System  Control, 
the  Analog-Digital  Interface,  and  an  array  of  Computation  Engines. 
System  performance  is  increased  by  using  multiple  computation 
engines  in  a  linear  pipelined  array.  The  functional  units  are 
accessed  and  configured  by  a  single  board  mim-computer  running 
a  menu  driven  man-machine  interface  (MMI) ,  Figure  2  shows  the 
interconnection  and  the  user  interface  via  the  mmi -computer  and  a 
terminal.  Once  configured  by  the  user,  the  LDS  will  run  m  real-  time 
until  stopped  by  the  user.  It  can  easily  be  reconfigured  to  compute 
different  adaptive  filter  algorithms,  modify  filter  parameters  change 
sample  frequency,  or  change  between  ALE  and  ANC  Tnp  ..i,,ig'a''’ 


Accepted  for  Publication;  PROCEEDINGS  OF  THE  ASILOMAR  CONFERENCE 
ON  CIRCIUTS,  SYSTEMS,  AND  COMPUTERS,  Pacific  Grove,  CA,  Nov  1991) 


Figure  2:  Slock  Diagram  of  the  NOSC  Adaptive  Lattice  Development  System  (LDS). 


in  Figure  3  shows  the  ALE  and  ANC  adaptive  filter  configurations  and 
ine  associated  nomenclature  for  the  reference.  x(n),  primary.  d(n). 
filter  output.  y(n).  and  error  output.  e(n).  signals. 


Figure  3:  Adaptive  Line  Enhancer  (ALE)  and  Adaptive  Noise 
Canceler  (ANC)  Filters. 

Real-Time  Operation 

The  System  Control  unit  runs  a  concurrent  operating  system 
which  initializes  all  actions  In  the  LDS.  Control  data  is  passed  by  the 
MMI  to  the  System  Control  unit  to  configure  or  reconfigure  the 
system.  The  control  unit  first  down  loads  the  microcode  for  each 
engine  and  then  initializes  each  engine's  state.  Once  the  system  is 
completely  configured,  it  will  perform  the  following  tasks  repeatedly 
when  commanded  to  run; 

1.  prompts  the  A-D  Interface  unit  to  acquire  16-bit 
quantized  data. 

2  accepts  data  from  the  A-D  Interface  unit  and  places  it 
in  local  RAM. 

3.  passes  input  data  to  the  engines  for  filtering. 

4.  initiates  data  transfers  between  Computation  Engines. 

5.  accepts  filter  output  data  and  other  outputs  from  the 
engines. 

6  sends  engine  outputs  to  the  A-D  Interface  unit  lor 
conversion  to  analog  output. 

7.  loops  to  1  . 

All  functional  units  are  synchronous  and  run  from  a  single  8MHz 
dock  distributed  throughout  the  LDS.  Engine  operations  are  started 


synchronously  but  proceed  independently,  so  they  can  run  different 
microcode  and  may  terminate  operations  independently.  Once  all 
engines  terminate  operations  in  a  given  filter  update  cycle,  the 
System  Control  unit  assumes  control.  The  computation  cycle 
repeats  with  synchronous  starts  on  command  from  the  System 
Control  unit  and  asynchronous  terminations  dependent  upon 
computation  requirements. 

Local  RAM  on  the  System  Control  unit  is  designed  to  handle 
multiple  circular  buffers  so  that  input  data  (reference  and/or 
primary)  can  be  delayed  up  to  a  combined  maximum  of  1 6K  data 
samples  before  being  sent  to  the  engines.  This  supports  many 
filter  configurations  including  the  ALE.  where  the  reference  data  is  a 
delayed  version  of  the  primary  data.  Another  feature  of  the  LDS  is  a 
user  definable  sampling  frequency. 

Filter  Implementations  with  a  Multiple  Engine  LDS 

Both  adaptive  lattice  and  adaptive  transversal  filters  can  benefit 
from  a  multiple  engine  LDS  configuration.  The  lattice  filter's  order 
recursive  variables  are  passed  from  stage  to  stage  in  sequential 
order.  This  can  result  in  extremely  long  processing  times  when  long 
filter  lengths  are  used.  The  total  processing  time  can  be  reduced  by 
a  factor  of  0(P)  by  using  P  pipelined  computation  engines.  Since 
each  engine  must  be  pipelined,  successive  engines  will  be 
processing  data  that  is  one  sample  period  earlier  in  time  than  the 
engine  which  supplies  its  passed  variables.  An  application's 
required  sample  rate  determines  both  the  number  of  engines  used 
and  the  maximum  number  of  stages  computed  on  each  engine 
Figure  4  shows  an  example  of  a  three  engine  LDS  computing  a 
6-stage  adaptive  lattice  filter.  At  the  completion  of  a  lattice  update 
cycle,  all  order  recursive  variables  are  passed  between  engines  via 
an  uni-directional  local  bus  in  a  synchronous  fashion. 

Adaptive  transversal  filters  can  be  implemented  on  a  multiple 
engine  LDS  in  several  ways,  even  though  multiprocessing  with  them 
is  less  straight  forward  than  tor  adaptive  lattice  filters  due  to  the 
global  error  feedback  for  coefficient  updating  [12].  One  of  the 
more  efficient  approaches  makes  use  of  the  multiple  engines 
configured  in  a  uni-directional  data  liow  ring,  as  was  proposed  by 
Miller  et.  al.  (13].  As  an  example  of  this  approach.  Figure  5  shows 
the  computation  of  an  8-weight  adaptive  transversal  filter  on  a  four 
engine  LDS.  Each  engine  is  responsible  for  only  those  calculations 
involving  a  section  of  the  complete  filter's  weights  The  process 
proceeds  by  each  engine  computing  the  convolution  sum  of  its 
section  of  the  filter.  A  series  of  synchronous  data  passing  anq 
summation  steps  are  then  performed  until  each  engine  has  its  own 
identical  copy  of  the  filtered  output.  Next  each  engine  generates  its 
own  identical  error  term  from  a  broadcasted  desired  signal  anq 
updates  its  own  set  of  weights  via  an  adaptive  algorithm  such  as 
least  mean  square  (LMS)  (14).  The  method  effectively  solves  the 


global  feedback  problem  by  generating  the  same  error  in  every 
processor.  Each  engine  has  sufficient  memory  to  store  the 
variables  for  a  4096-weight  LMS  filter. 


Pri  d(n)  broadcasted  to  all  engines 


Figure  5:  Calculation  of  an  8-stage  Adaptive  Transversal  Filter  in 
a  Four  Engine  LOS. 

Analysis  Tools 

The  LDS  software  provides  the  user  with  many  useful  features 
for  analyzing  adaptive  filters.  Specific  filter  coefficients  can  be 
selected  for  observation  as  they  adapt  in  time,  or  a  snapshot  of  all 
coefficients  can  be  taken  at  a  specific  time  period.  The  coefficients 
are  transferred  to  data  files  on  the  mini-computer's  hard  disc  from 
which  further  analysis  can  be  done  off-line.  The  system  also  has 
the  capability  of  inputting  data  directly  from  a  data  file  or  outputting 
filtered  data  directly  to  a  f[le. 

III.  Computation  Engine  Design 


The  Computation  Engine  is  the  processing  heart  of  the  LDS.  it 
was  custom  designed  to  efficiently  implement  adaptive  lattice  filters 
and  IS  composed  of  two  main  parts:  the  microsoquencor  and  the 
microengine.  The  engine’s  operation  is  controlled  by  up  to  2K 
microwords  down  loaded  from  the  System  Control  unit  during  the 
LDS's  initialization.  The  equations  that  describe  adaptive  lattice 
algorithms  are  structured  into  groups  called  stages  (see  Figure  6) . 
Each  stage  can  have  variables  which  are  updated  entirely  by 
time-recursions.  entirely  by  order-recursions  or  by  a  combination 
of  both  time  and  order-recursions.  Adaptive  lattice  algorithms  such 
as  the  RLSL  (10.111  and  the  stochastic  gradient  lattice  (15)  use 
both  time  and  order  recursions  to  save  computations.  In  addition, 
these  adaptive  lattice  algorithms  require  multiple  divisions  per 
stage  The  following  features  were  incorporated  in  order  to  achieve 
real-time  computation  speeds  for  adaptive  lattice  filters: 

•  use  of  a  32-bit  floating-point  processing  chip  with  a 
latency  of  only  one  clock  cycle, 

•  efficient  implementation  of  floating-point  division, 

•  separate  memories  for  time-recursion  and 
order-recursion  variables, 


•  implementation  of  a  floating-point  &  integer 
comparator, 

•  provision  of  many  data  paths  to  support  ail  of  me 
above, 

•  extension  of  data  paths  across  multiple  computational 
engines. 

•  increased  data  bandwidth  by  using  uni-directionai  local 
busses  to  link  neighboring  engines. 

•  use  of  a  simple  microsequencer  and  a  horizontal 
microword. 

The  utility  of  these  features  is  described  m  the  nc\i  imo  ^ecUl':.' 


time-recursions 


Figure  6:  Data  Flow  for  Single  Adaptive  Lattice  Stage 
Computations. 

Microengine  Arithmetic  Structure  and  Data  Paths 

Figure  7  shows  the  data  paths  m  the  microengme.  Tt-e 
Advanced  Micro  Device's  AM29325  [161  was  the  floating-point 

processor  chosen  for  two  mam  reasons:  it  can  compute  arithmetic 
operations  in  a  single  clock  cycle  and  it  suooorts  fast  mvers  on 
instructions.  A  short  pipeline  depth  is  desirable  when  implementing 
tightly  recursive  algorithms,  such  as  RLSL.  which  often  require  tne 
result  of  one  computation  for  the  very  next  computation.  'Inversions 
are  computed  with  a  Newton-Raphson  iteration  technique  using  a 
first  approximation  seed  sufficiently  large  to  require  only  one 
iteration  (a  total  of  three  floating-point  operations)  to  achieve  fu  . 
32rbit  floating-point  precision.  This  technique  has  quadratic 
convergence  properties  and  the  only  additional  hardware  needed  .s 
an  inverse  seed  PROM.  in  parallel  with  the  AM2932E  is  a 
floating-point  &  integer  comparator  which  supports  such  tn.ngs  as 
variable  bounding  and  lattice  filter  order  control. 

Numerous  data  paths  allow  for  the  efficient  moving  of  data 
between  the  AM29325.  registers,  and  memory.  A"  data  paths  and 
operations  are  controlled  by  individual  terms  m  me  microcode  A 
96-bit  horizontal  microword  is  used  which  allows  complete  fiexibii  ty 
and  eliminates  decoding  time.  One  additiona.  jata  path  not  shov.n 
in  Figure  7,  but  frequently  used,  is  an  intern  i  ,yrap  around  data  pain 
inside  the  AM29326.  The  microengme  <■  piogrammed  wun  a  user 
defined  assembler,  aided  by  using  a  reservation  chart  to  keep  tracv 
of  the  multiple  data  paths  and  concurent  operations 

Automatic  system  order  detection  is  a  unique  capability  o' 
adaptive  lattice  filters.  Control  oi  the  length  of  the  lattice  filter  is 
necessary  to  minimize  filter  generated  noise  and  insure  stability  c‘ 
the  final  error.  e(n),  and  filtn.  output.  y(n).  Rapid  variation  o*  Lite' 
order  IS  possible  with  non  stationary  signals  so  order  control  m.,s: 
function  in  real-time.  The  forward  and  backward  pred  ct  on 
residuals  of  each  successive  stage  must  be  compared  to  a 
threshold  as  they  are  computed  to  determine  '  another  stage  s 
needed  (or  allowjd)  for  the  current  update.  Tne  value  o'  f'e 
threshold  IS  del. nod  by  the  user,  based  on  the  engme  s  anthr-e:  ; 
precision  and  the  exponential  windowing  of  me  data  Tcnt'c  '”„s' 
span  across  engine  boundaries  and  accom.modate  me  t 
displacement  at  those  boundaries  This  is  done  with  a  spec  a  '  m 
set  by  each  engine  which  is  checked  ana  cleared  ori  me  • 
itera'  on  by  its  downstream  target  engine 


to  Control  Bus 
and  Monitor  OAC 


Figure  7:  The  Computation  Engine's  Microengine  Data  Path. 
Microengine  Memory  Structure 

Two  separate  data  memories  are  available  to  the  AM29325.  one 
;s  tailored  to  support  time-recursions,  the  other  order-recursions. 
The  Permanent  Memory  (PM)  is  addressable  in  blocks 
corresponding  to  lattice  stages  for  ease  of  storage  and  retrieval 
during  time-recursions.  Eight  variables  can  be  stored  in  each  PM 
block  as  seen  in  Figure  8.  The  structure  is  1K-stages  x  8-words  x 
32-bits  allowing  a  maximum  of  a  1024  stage  lattice  filter  per 
engine.  The  PM  uses  a  block  addressing  scheme  where  each  block 
IS  selected  by  a  stage  number  counter.  The  block  addressing 
greatly  simplifies  the  engine's  microcode  during  the  computation  of 
a  lattice  stage  without  sacrificing  the  use  of  simple  addressing 
schemes  for  transversal  filters. 

The  Scratch  Pad  Memory  (SPM)  in  addition  to  being  a  temporary 
storage  area  for  data  transfers  and  intermediate  results,  can  be 
structured  to  operate  in  a  “ping-pong"  fashion  between  the  upper 
and  lower  halves  to  allow  the  overwriting  of  order-recursion 
variables  that  are  no  longer  needed.  During  the  computation  of  a 
lattice  filter,  variables  are  passed  between  the  two  SPM  halves  with 
the  current  stage  overwriting  variables  passed  from  a  previous 
stage.  Each  SPM  half  contains  32-words  x  32-bits  of  dual-ported 
memory.  With  current/ next  addressing,  there  is  no  need  to  keep 
track  of  the  absolute  physical  addresses  being  used  which  saves 
machine  cycles  and  engine  instructions.  In  addition,  it  is  easier  to 
conceptualize  order  recursion  variables  as  current  and  next  (Figure 
6)  Special  address  control  logic  switches  the  physical  location  of 
the  order  recursion  variables  in  a  manner  that  is  transparent  to  the 
software  (and  the  programmer).  The  result  is  a  very  efficient  use  of 
memory  space  and  a  simple,  efficient  addressing  scheme.  For  the 
computation  of  transversal  filters,  the  SPM  is  configured  by  software 
as  a  single  64-word  x  32-bit  dual-port  memory  (see  Figure  9). 
Memory  locations  m  both  the  SPM  and  the  PM  can  be  accessed  m  a 
singifj  clock  cycle  at  the  same  time. 


Figure  8;  Engine  Board  Permanent  Memory  Configuration 
(IK  X  32-bits  X  8). 
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Figure  9:  Engine  Board  Scrach  Pad  Memory  Configuration 
(64  »  32-bits  or  two  32  x  32-bit  halves  i 


IV.  LDS  Test  Data 


As  of  this  writing  the  following  adaptive  filter  aigorit'ims  ng,,e 
been  programmed;  IMS  transversal  [14],  Block  IMS  trans.ersa 
(17).  Stochastic  Gradient  Lattice  [151,  Recursive  Least  Sauares 
Lattice  [10.1  i]  and  Direct  Coefficient  Updating  RlSl  [18]  Test  '-.g 
with  the  LMS  algorithm  began  m  March  of  1989  while  testing  w.tn  tne 
RLSL  began  m  June  ol  1989  The  other  algorithms  cee-' 

included  in  1990.  A  photograph  of  the  LDS  is  provdeo  ir^  =  gu-e 
Figure  1 1  IS  a  photograph  of  the  two  boards  cc-'l'S'  .; 
Computation  Engine 

As  an  example  of  LDS  operation,  the  tm'o  ser  es  L„tp.,ts  ' 
and  RLSL  ANCs  are  shown  tor  comparison  m  Fig.j-es  tJ  a  'J  ’ 
respectively  In  both  cases  a  single  smusoid  is  be  ny  L.a-iceieJ  '■  ■ 

the  primary  input  Note  the  characteristic  exponentia  ao.ipt.r'  . 
LMS.  and  the  nearly  instantaneous  adaptation  of  tt  e  n.  L:. 


V.  Conclusion 


This  paper  has  presented  the  design  of  a  real-time  adaptive  filter 
development  system.  The  system  was  custom  built  for  the  purpose 
of  studying  the  performance  of  lattice  algorithms  in  real  world 
environments  and  comparing  them  to  other  adaptive  algorithms.  To 
this  end,  special  hardware  and  software  features  were  incorporated 
into  the  design  to  maximize  both  performance  and  flexibility,  in 
addition,  provisions  to  allow  the  observation  of  filter  variables  during 
adaptation  were  incorporated  to  aid  analysis.  The  system  is 
lypically  configured  as  a  linear  array  of  programmable  engines  to 
enhance  execution  speeds.  A  description  of  the  architecture  and 
design  details  were  presented  along  with  sample  test  data. 
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Figure  10:  Photograph  of  the  NOSC  Adaptive  Lattice 
Development  System  (LDS). 


Figure  12:  LMS  Adaptive  Noise  Cancellation  of 
a  Single  Sinusoid. 
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Figure  11:  Photograph  of  the  Computation  Engine  Boards  Figure  13:  HLSL  Adaptive  Noise  Canceiia;  o'' 

a  Single  Sinusoid 


